11/17/2020

Agenda

  • Introduction
  • K-means
  • Hierarchical clustering
  • Principal Component Analysis (PCA)
  • Principal Component Regression (PCR)

Introduction

Unsupervised learning

We have seen a lot of models for \(y \mid x\). This is called supervised learning (we explain the relationship between \(x\) and a special chosen variable \(y\)).

Now we will see models for \(x\) alone.

  • This generally means that analysis goals are not clearly defined and we cannot directly validate the findings (no such thing as RSS)
  • However, there are still many important questions that we can consider:
    • Finding lower-dimensional representations
    • Finding clusters / groups

Clustering methods

  • Clustering consists in dividing the data points into categories which are not defined in advance
  • In classification, instead, categories are defined in advance, and we have explicit labels for them

Clustering methods

  • Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set
  • We seek a partition of the data into distinct groups so that the observations within each group are quite similar to each other
  • To make this concrete, we must define what it means for two or more observations to be similar or different
  • This is often a domain-specific consideration that must be made based on knowledge of the data being studied

Partition = mutually exclusive, exhaustive set of nonempty subsets (each data point is in one and only one cluster)

Example: Market Segmentation

  • Suppose we have access to a large number of measurements (e.g. median household income, occupation, distance from nearest urban area, and so forth) for a large number of people
  • Our goal is to perform market segmentation by identifying subgroups of people who might be more receptive to a particular form of advertising, or more likely to purchase a particular product
  • The task of performing market segmentation amounts to clustering the people in the data set

Criteria for clustering

How can we cluster without labels?

  • Data points within the same cluster should be close/similar, and data points in different clusters should be far apart/dissimilar
  • Clusters should (ideally) be balanced

Key fact: we need to know about distances to quantify similarity (within a cluster) and difference (between clusters)

Criteria for clustering

There is no right or wrong answer: how many clusters do you see?

Criteria for clustering

There is no right or wrong answer: how many clusters do you see?

Criteria for clustering

There is no right or wrong answer: how many clusters do you see?

Distance functions

Properties of distance functions:

  1. \(d(x,x^\star) \geq 0\)
  2. \(d(x,x^\star) = 0 \Leftrightarrow x = x^\star\)
  3. \(d(x,x^\star) = d(x^\star, x)\) (symmetry)
  4. \(d(x,x^\star) \leq d(x,x^\prime) + d(x^\prime,x^\star)\) (triangle inequality)

Distance functions

  • Euclidean (\(\ell^2\)): \[d(x,x^\star) = \left[ \sum_{d=1}^D (x_{d} - x^\star_{d})^2 \right]^{1/2}\]

  • Manhattan (\(\ell^1\)): \[d(x,x^\star) = \sum_{d=1}^D |x_{d} - x^\star_{d}|\]

Two clustering methods

  • \(K\)-means clustering: we seek to partition the observations into a pre-specified number of clusters
  • Hierarchical clustering: we do not know in advance how many clusters we want; in fact, we end up with a tree-like visual representation of the observations, called a dendrogram, that allows us to view at once the clusterings obtained for each possible number of clusters, from \(1\) to \(n\)

\(K\)-means

The goal

The idea behind \(K\)-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible.

The within-cluster variation for cluster, \(\text{WCV}(C_k)\), is a measure of the amount by which the observations within a cluster differ from each other.

Hence we want to solve the problem \[\text{minimize}_{C_1 ,\dots,C_K} \left\{ \sum_{k=1}^K WCV(C_k) \right\}\]

Details

Typically we use Euclidean distance \[\text{WCV}(C_k) = \frac{1}{|C_k|} \sum_{i, i^\prime \in C_k} \sum_{j=1}^p (x_{ij} - x_{i^\prime,j})^2\] where \(|C_k|\) denotes the number of observations in the \(k^{th}\) cluster. Combining the previous equations,

\[\text{minimize}_{C_1 ,\dots,C_K} \left\{ \sum_{k=1}^K \frac{1}{|C_k|} \sum_{i, i^\prime \in C_k} \sum_{j=1}^p (x_{ij} - x_{i^\prime,j})^2 \right\}\]

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

  1. Randomly assign a number, from \(1\) to \(K\), to each of the observations. These serve as initial cluster assignments
  2. Iterate until the cluster assignments stop changing:
    • For each of the \(K\) clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid is the vector of the \(p\) feature means for the observations in the \(k^{th}\) cluster
    • Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance)

\(K\)-means tries to minimize the within-cluster variability

Initializations

The \(K\)-means algorithm is not guaranteed to find a global optimum: different initializations yield different answers.

Label sqitching

Initializations

Why can’t we look at all possible clustering configurations? There are too many possible clusterings to check them all!

\[\begin{aligned} A(n, K) &= \text{# assignemnts of n points to K groups} \\ &=\frac{1}{K!} \sum_{j=1}^K (-1)^{K-j} {K \choose j} j^n \end{aligned}\]

\[\begin{aligned} &A(10, 4) = 34105 \\ &A(25, 4) \approx 5 \cdot 10^{13} \text{(ouch)} \end{aligned}\]

How to solve this issue

  • Multiple restarts
  • Select more than K initial centroids and then choose the most widely separated among these initial centroids
  • Different initialization strategies (e.g. \(K\)-means++)

K-means ++

Initialize by choosing \(1\) centroid at random, call it \(m_1\).

Then, for \(k = 2,\dots,K\):

  1. For each point, compute \(d_i\) as the minimum distance of \(x_i\) to the existing centroids
  2. Choose \(x_i\) as initial centroid \(k\) with probability proportional to \(d_i^2\)

Thus points that are far away from existing cluster centroids are more likely to get chosen.

Note: this is just a different initialization strategy. After initialization, it works the same as \(K\)-means.

Pathological example

\(K\)-means and \(K\)-means++ cannot solve all problems…

Pathological example

\(K\)-means and \(K\)-means++ cannot solve all problems…

Hierarchical clustering

Main difference with \(K\)-means

\(K\)-means clustering and basic mixture models require us to pre-specify the number of clusters \(K\).

Hierarchical clustering:

  • is an alternative approach which does not require that we commit to a particular choice of \(K\)
  • produces a set of nested clusters organized hierarchically
  • can be visualized as a “dendrogram”, a tree-like diagram that records the sequences of merges or splits

We describe bottom-up or agglomerative clustering: common type of hierarchical clustering, it refers to the fact that a dendrogram is built starting from the leaves and combining clusters up to the trunk.

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

Algorithm steps

  1. Start with each point in its own cluster
  2. Identify the “closest” two clusters and merge them
  3. Repeat
  4. Ends when all points are in a single cluster

Choose the number of clusters

\[K = 3\]

Choose the number of clusters

\[K = 2\]

Types of linkage

What does “closest two clusters” mean? How do we measure similarity/distance between two clusters of points?

.

Four obvious options: min, max, average, centroids. There are called linkage functions.

Complete linkage

Maximal inter-cluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.

Single linkage

Minimal inter-cluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.

Average linkage

Mean inter-cluster dissimilarity: compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.

Centroid linkage

Dissimilarity between the centroid for cluster A (a mean vector of length \(p\)) and the centroid for cluster B.

Types of linkage

Each linkage has its own pros and cons:

  • Min: more sensitive to noise and outliers, but will leave large, obvious clusters mostly intact
  • Max: more robust to noise and outliers, but tends to break up large clusters
  • Average and centroid: kind of a compromise between the two

Applications

Image compression

In a color representation of an image, each pixel is represented as three unsigned integers (ranging from 0 to 255) that specify the red, green and blue intensity values (RGB encoding).

Image compression

An image contains thousands of colors (every combination of RGB values). We now want to reduce the number of colors to 20. We can do so by clustering pixels.

Image compression

Image compression

Clustering

Always ask yourself:

  • What am I clustering? The rows of the matrix: in this case, pixels.
  • What are the centroids? Points in the feature space: in this case, colours.

Image clustering

Instead of clustering pixels in groups of pixels sharing similar colours, we now cluster images in groups of images that share similar patterns.

  • Each row of the data is an image, not a pixel
  • Each column is a pixel value, not a RGB value

Hopefully, we get a cluster that is representative of the digit “1”, another of the digit “2”, and so on…

Question time